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SYSTEM AND METHOD FOR CONFIGURING ADAPTIVE 
SETS OF LINKS BETWEEN ROUTERS IN A 
SYSTEM AREA NETWORK (SAN) 



10 CROSS-REFERENCES TO RELATED APPLICATIONS 

This application is a continuation-in-part of Application Serial Nos. 
09/224,114 filed December 30, 1998 and 09/228,069, filed December 30, 1998, the 
disclosures of which are incorporated herein by reference. 

5 BACKGROUND OF THE INVENTION 

A System Area Network (SAN) is used to interconnect nodes within a 
distributed computer system, such as a cluster. The SAN is a type of network that 
provides high bandwidth, low latency communication with a very low error rate. SANs 
often utilize fault-tolerant technology to assure high availability. The performance of a 
20 SAN resembles a memory subsystem more than a traditional local area network (LAN). 

The preferred embodiments will be described implemented in the 
ServerNet architecture, manufactured by the assignee of the present invention, which is a 
layered transport protocol for a System Area Network (SAN). The ServerNet II protocol 
25 layers for an end node and for a routing node are illustrated in Figure 1 . A single session 
layer may support one or two ports, each with its associated transaction, packet, link- 
level, MAC (media access) and physical layer. Similarly, routing nodes with a common 
routing layer may support multiple ports, each with its associated link-level, MAC and 
physical layer. 



Support for two ports enables ServerNet SAN to be configured in both 
non-redundant and redundant (fault tolerant, or FT) SAN configurations as illustrated in 
Figure 2 and Figure 3. On a fault tolerant network, a port of each end node may be 




connected to each network to provide continued message communication in the event of 
failure of one of the SANs. In the fault tolerant SAN, nodes may be also ported into a 
single fabric or single ported end nodes may be grouped into pairs to provide duplex FT 
controllers. The fabric is the collection of routers, switches, connectors, and cables that 
5 connects the nodes in a network. 

The SAN includes end nodes and routing nodes connected by physical 
links. Each node may be an end node which generate and consume data packets. Routing 
nodes never generate or consume data packets but simply pass the packets along from the 
1 0 source end node to the destination end node. 

Each node includes bidirectional ports connected to the physical link. A 
link layer protocol (LLP) manages the flow of status and packet data between ports on 
independent nodes. 



original ServerNet configuration is designated SNet I and the improved configuration is 
designated SNet II. Among the improvements implemented in SNet II SAN is a higher 
transfer rate and different symbol encoding. Links between SNet II endnodes have a data 
20 transfer rate of 125 MB/s, Future CPUs and I/O devices will require much faster data 
transfer rates. However, to significantly increase the link transfer rate would require 
discontinuing use of low-cost commoditiy serial links such as the 1.25 Gbit serial links 
common to Ethernet. 



physical links connecting a pair of routers. The multiple links of the adaptive set are 
called lanes. The router includes logic for adaptively routing packets received at an input 
30 port to the various lanes. A source end node controls whether packets destined for the 
router are routed deterministically or adaptively by encoding control bits in the packet 
header. The adaptive set configuration allows the use of commodity serial links while 
allowing for unusual bandwidth needs and future scalability. 
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The ServerNet SAN has been enhanced to improve performance. The 



SUMMARY OF THE INVENTION 



According to one aspect of the invention, an adaptive set is a plurality of 
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According to another aspect of the invention, the control bits may specify 
that a packet be routed through a particular lane in an adaptive set. 

5 According to another aspect of the invention, all lanes of an adaptive set 

can be flushed by encoding the control bits in flush packets to sequentially flush all lanes 
of the adaptive set. 

According to a still further aspect of the invention, the number of lanes 
10 that can be included in an adaptive set is limited to a particular number. During a flush, 
packets sequence through the particular number of lanes. 

According to a still further aspect of the invention, uplinks from a 
particular router in a lower level of a fat tree topology are configured as an adaptive set. 
15 These links are coupled to different routers in an upper layer so that packets are 

distributed adaptively from a particular router in the lower level to multiple routers in the 
upper layer. 

Additional advantages and features of the invention will be apparent in 
20 view of the following detailed description and appended drawings. 



BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a block diagram depicting ServerNet protocol layers implemented 
25 by hardware, where ServerNet is a SAN manufactured by the assignee of the present 
invention; 

Figs. 2 and 3 are block diagrams depicting SAN topologies; 

30 Fig. 4 is a schematic diagram depicting routers and links connecting SAN 

end nodes; 



Fig. 5 is a block diagram of a router; 



Fig. 6 is a physical link into physical lane translation table; 

Fig. 7 is a block diagram depicting the contents of a packet header; 

5 

Fig. 8 is a block diagram depicting the contents of the destination field; 

Fig. 9 is a table defining the encoding of the adaptive control bits (ACB); 

10 Fig. 10 is a flow chart of link to lane translation and back again; 

Fig. 1 1 is a schematic diagram depicting the use of adaptive sets as uplinks 
in a fat tree; and 

15 Fig. 12 is a schematic diagram depicting the downlinks in a fat tree. 



DESCRIPTION OF THE SPECIFIC EMBODIMENTS 

A preferred embodiment of the invention will now be described in the 
20 context of the ServerNet (SNet) system area network (SAN). SNet I and SNet II are 
scalable networks that support read, write, and interrupt semantics similar to previous 
generations I/O busses and are manufactured and distributed by the assignee of the 
present invention. The ServerNet I system is described in U.S. Patent No. 5,675,807 
which is assigned to the assignee of the present application. 

25 

Communication between nodes coupled to ServerNet is implemented by 
forming and transmitting packetized messages that are routed from the transmitting ,or 
source node, to a destination node by a system area network structure comprising a 
number of router elements that are interconnected by a bus structure of a plurality of 
30 interconnecting links. The router elements are responsible for choosing the proper or 
available communication paths from a transmitting component of the processing system 
to a destination component based upon information contained in the message packet. 
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A router is an intelligent hub that routes traffic to a designated channel. In 
a ServerNet SAN, the router is a twelve- way crossbar switch that interconnects all of the 
ServerNet system components (processors, storage, and communications) for 
unobstructed, high-speed data passing. Each link between routers has a maximum 
5 bandwidth determined by the width of the link and the rate of data transfer. Bandwidth 
may be increased by configuring multiple links between routers as a link set or "Adaptive 
Set". Transfers that do not require strict ordering of packets may route the packet along 
any available lane of the Adaptive Set. 

10 Configuring multiple links to be part of an Adaptive Set allows for higher 

bandwidth with little change to ServerNet hardware. At the router, a packet has to decide 
which link of a Adaptive Set to use. 

Fig, 4 depicts a network topology utilizing routers and links. In Fig. 4, end 
15 nodes A-F, each having first and second send/receive ports 0 and 1, are coupled by a 
ServerNet topology including routers R1-R4. Links are represented by lines coupling 
ports to routers or routers to routers. A first Adaptive Set 2 couples routers Rl and R3 
and a second Adaptive Set 4 couples routers R2 and R4. 

20 Thus, port 0 of end node A, port 0 of end node D, ports 0 and 1 of end 

node E, and port 0 of end node F may transfer data through the first Adaptive Set 2. 

Fig. 5 is a block diagram of a router chip having twelve fully independent 
input ports 6, each with an associated output port 8, a routing control block 10, a simple 
25 packet interface for use with inband control messages 12, a fully non-blocking 13x13 
crossbar 14, an interface for JTAG test and microcontroller connections 16. 

Each input module includes receive data synchronizers, elastic FIFOs 20 
and 22, and flow control logic. Each input module passes the header information to 
30 routing module, which determines the appropriate target port for the packet. The routing 
module also controls the selection of links in any Adaptive Sets as will be described more 
fully below. 
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ROUTER CONFIGURATION 

A router includes routing and configuration logic to route an incoming 
packet to the correct output port and to configure Adaptive Sets. The routing logic 
5 includes a routing table having 1024 entries each including a 4-bit port or Adaptive Set 
specifier and a bit to tell if the entry is for a Adaptive Set. 



As described above, in a preferred embodiment each router has 12 ports. 
The following is the currently preferred Adaptive Set implementation restrictions: 

10 • The maximum number of physical links in a Adaptive Set is 4. 

• There are 6 Adaptive Sets (maximum) that can be used (2 ports minimum per 
Adaptive Set). 

• A port can be in a maximum of one Adaptive Set (a port can not be part of two 
Adaptive Sets). 

1 5 • There are no restrictions to what ports can be in a given Adaptive Set - any physical 
port can be included in any one Adaptive Set. 
Adaptive Set 

Logically, a Adaptive Set is composed of a plurality of lanes. Adaptive 
Set configuration registers are used to translate the lane to a physical link. 

20 

Fig. 6 is a table illustrating the definition of two Adaptive Sets in a router 
conforming to the above-listed restrictions. Adaptive Set 0 is defined to be composed of 
three ports: 1, 6, and 9 and Adaptive Set 1 is defined to be composed of four ports: 5, 7, 8, 
and 11. Fig. 6 shows the two Adaptive Sets, the physical links that compose the Adaptive 
25 Set Adaptive Sets, and simple mapping of a Adaptive Setlane number into a given link of 
an Adaptive Set. 
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PACKET ROUTING 

As depicted in Fig. 7, each packet includes a header containing three fields 
which specify the destination of the packet (including routing information), the source of 
the packet (including packet type information), and control information. 

Fig. 8 depicts the contents of the destination field. The region and device 
bits are used to access the routing table and determine the correct output port for a 
received packet. The ACB (adaptive control bits) are used to alert the Adaptive Set logic 
on the router whether the packet could use the adaptive routing capabilities of the 
Adaptive Set or if the packet should be routed down a specific lane of the Adaptive Set. 

The encoding of the ACB bits is depicted in Fig. 9 where RFD denotes 
routing flow diagram. Note that the first four encodings specify ordered packet delivery 
so that a specified lane of the Adaptive Set is utilized and the adaptive routing capability 
is not utilized. The ordering of packets sent from a specific source to a specific 
destination cannot be assured if adaptive routing is used. 

When a packet enters the router, it flows through a routing flow diagram 
(RFD) as depicted in Fig. 10. When a packet is received the RFD designates a 
preliminary port assignment (PPA) for the packet. If there were no Adaptive Set the 
packet would be routed to the PPA. The router determines if the PPA is part of a 
Adaptive Set by comparing it with the static Adaptive Set definition (e.g., Fig. 6). If the 
PPA is part of a Adaptive Set then the PPA, which contains a physical link number, is 
translated into a physical lane number of a particular Adaptive Set. 

If the PPA is part of a Adaptive Set, then the ACB field is examined to 
determine whether ordered packet delivery is specified. If so, the ACB field specifies the 
offset value added to the lane number of the PPA to determine on which lane of the 
Adaptive Set the packet should be routed. The router then checks to determine whether 
the lane selected is on-line and finally converts from a lane number of a particular 
Adaptive Set to a physical link of the router. 



If one of the physical links of a Adaptive Set becomes unavailable due to 
being taken off-line through link-level protocol errors, the Adaptive Set will reconfigure 
itself so that the lost link is not used as part of the Adaptive Set until the link comes back 
on-line. In the event that a packet is received that specifies ordered routing on a lane of 
the Adaptive Set that has been taken off-line, then the packet will be routed on the next 
link of that Adaptive Set that is active (not off-line). 

Thus, although Adaptive Sets are defined at the router nodes, the source 
controls the use of the Adaptive Set by setting the ACB bits. An important result of the 
use of Adaptive Sets is that packets may arrive at the destination out of order. For 
example, the receive FIFOs of ports coupled to some of the output ports forming a 
Adaptive Set may be full and not be accepting further packets (i.e., exerting back 
pressure). Packets routed to these lanes of the Adaptive Set will be delayed while packets 
routed to other lanes will be transmitted immediately. Thus, at the router, earlier received 
packets routed to a lane experiencing back pressure will be transmitted after later received 
packets routed to a lane not experiencing back pressure. Accordingly, the packets will 
not be transmitted in the order received. 

In a preferred embodiment, a SEND transaction is implemented that 
requires strict ordering. This is necessary because the receiving node places the incoming 
packets into a scatter list. Each incoming packet goes to a destination determined by the 
sum total of bytes of the previous packets. The strict ordering of packets is necessary to 
preserve integrity of the entire block of data being transferred, because incoming packets 
are placed in consecutive locations within the block of data. For this transaction, the 
ACB bits in each packet header would specify the same lane of the Adaptive Set. Then, 
if a Adaptive Set has been defined in router, only a single link would be used, thereby 
assuring ordered transmission. 

On the other hand, a remote direct memory access (RDMA) transaction 
does not require that packets be received in order. An RDMA packet contains the address 
to which the destination end node writes the packet contents. This allows multiple 
RDMA packets within an RDMA message to complete out of order. The contents of each 
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packet are written to the correct place in the end node's memory, regardless of the order 
in which they complete. The RDMA may use adaptive routing if a Adaptive Set is 
defined by setting the ACB field to 100 (Unordered Packet Delivery, see fig. 6). 

5 Thus, if a Adaptive Set is defined in the router, the source can control 

whether routing is deterministic or adaptive through the use of the ACB bits in the 
destination field. 

1 0 ERROR RECOVERY AND BARRIER TRANSACTIONS 

The ServerNet SAN recovers from errors by retransmitting packets 
previously transmitted subsequent to the occurrence of an error. As described above, 
packets that have been transmitted are stored in the receive and transmit FIFOs of the 
routers in the fabric. Thus, prior to retransmission it must be assured that these stale 

15 packets, i.e., packets transmitted after the error occurred, are flushed from all the FIFOs. 
In the preferred embodiment, a path is flushed by performing a barrier transaction, which, 
in the most general form, is a write of a particular value to the remote end node on the 
path to be flushed followed by a read of the particular value from the remote node. 
Clearly, For each link, the barrier transaction packet will not reach the end node until all 

20 stale packets preceding the barrier transaction have reached the end node. The end node 
discards those packets received prior to the barrier transaction packet. 

For deterministic routing the path is composed of serially connected links, 
so the barrier transaction necessarily flushes all stale packets. However, if routers have 
25 defined Adaptive Sets and adaptive routing is specified then stale packets may reside in 
all the parallel physical links which form the Adaptive Set. 

The ACB offset bits allow the source to flush each lane of a Adaptive Set. 
By using the first four forced ordering encodings of the ACB all possible lanes of a 
30 Adaptive Set may be selected for routing a packet. By stepping through these four 

encodings (four being the maximum number of links in a Adaptive Set), all of the ports 
that a packet can traverse when going between two end nodes can be flushed. For 
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software to flush out the path between two end nodes the following algorithm should be 
performed: 

for i = 0 to 3 

5 Write location (ACB field = i); /write portion of barrier operation 

Read location (ACB field = i); /read portion of barrier operation. 

The index i is stepped from 0 to 3 because the maximum number of links 
that compose a Adaptive Set is 4. When performing this algorithm, the software does not 
10 need to know if there is a fat link in the routing network or the number of links 

composing the Adaptive Set. The flush is successful only if each read function returns 
the appropriate unique value for each i. 

The forced ordering encodings of the ACB allow thorough diagnostics of 
1 5 Adaptive Set links, and allow each link of a pipe to be tested individually. 

FAT TREES UTILIZING ADAPTIVE LINKS 

A fat tree is a tree where the number of links is increased each layer above 
20 the leaf nodes. In the above, a Adaptive Set was defined as having all its links connected 
to the same node. However, the same implementation in the router also allows the links 
to be connected to different destination routers. Figs. 1 1 and 12 depict a two-level fat tree 
having three routers in each level. The routers Rl 1, R12, and R13 in level 1 are "leaf 
routers connected to end nodes EN1, EN2, and EN3 by conventional links. 

25 

Fig. 1 1 depicts the up-links from level 1 to level 2. Each router in level 1 
has three of its output up-links configured as a Adaptive Set. Each up-link in the 
Adaptive Set is connected to a different router of level 2. Thus, unlike the above- 
described embodiment, links in an adaptive set may be coupled to different routers. 



Fig. 12 depicts the down links of the fat-tree. Each router in the upper 
level is connected to a router in the lower level by a single, deterministic down-link with 
no adaptivity supported. 
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The result of this configuration is for traffic from end nodes to be 
distributed adaptively to the upper level routers while progressing upwards in the fat tree, 
5 and then to get routed deterministically when traveling in the downward direction. 
Alternating traffic adaptively through the three Adaptive Set up-links of each level 1 
router gives much better average link utilization than if the upward links were selected 
statically based on destination ID. No matter how static partitioning is done, there is 
some traffic pattern that could cause all traffic to queue for a single link to the next level 
10 of the tree. 

In larger topologies, multiple Adaptive Sets can be encountered on the way 
to the destination. 

1 5 The invention has now been described with reference to the preferred 

embodiments. Alternatives and substitutions will now be apparent to persons of skill in 
the art. In particular, the adaptive sets are limited to any number of links or any particular 
configuration protocol. Further, fat trees may include an arbitrary level with adaptive 
links in different sets of uplinks between the levels. Accordingly, it is not intended to 

20 limit the invention except as provided by the appended claims. 
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WHAT IS CLAIMED IS: 

1 1 . In a system area network (SAN) including a source node and a 

2 destination node coupled by a network fabric, with the system for transferring data 

3 between a source node and a destination node, with the network fabric coupling the 

4 source and destination nodes including first and second routers having multiple input 

5 ports coupled to multiple output ports by a cross-bar switch, and with the SAN 

6 implementing data transfers as a sequence of request/response packet pair transactions, 

7 with each request and response packet containing a header including a destination field, 

8 and with the SAN for implementing ordered transactions requiring that packets be 

9 received in the order transmitted and unordered transactions where packets may be 

1 0 received out of order, a system for implementing adaptive sets of lanes between said first 

1 1 and second routers, said system comprising: 

12 configuration logic at said first router for configuring an adaptive set 

13 including multiple lanes, with a the control logic associating a designated input port with 

14 the adaptive set and associating a unique output port with each lane of the adaptive set; 

1 5 routing option control logic at said source node for setting adaptive control 

16 bits in said destination field to specify whether the packet could use the routing 

17 cappabilities of the adaptive set or should be routed down a specific lane of the adaptive 

18 set; 

19 routing control logic at said first router, responsive to the destination field 

20 of a packet received at said designated input port, for assigning a specific output port to 

21 said packet, and, if said specific output port is associated with said adaptive set, 

22 adaptively assigning a port associated with a lane in the adaptive set if the adaptive 

23 control bits specify adaptive routing or deterministically specifying said specific output 

24 ports if said adaptive control bits specify determinist routing. 

1 2. The system of claim 1 wherein: 

2 said routing control logic includes a routing table with each entry in the 

3 table including a bit specifying whether the entry is for an adaptive set, and if so, a field 

4 identifying the adaptive set. 



13 



1 3. In a system area network (SAN) including a source node and a 

2 destination node coupled by a network fabric, with the system for transferring data 

3 between a source node and a destination node, with the network fabric coupling the 

4 source and destination nodes including a router having multiple input ports coupled to 

5 multiple output ports by a cross-bar switch, where the router may include an adaptive set 

6 of lanes coupled to an input port where a designated output port is assigned to each lane 

7 so that packets received at the input port may be adaptively routed on any one of the 

8 multiple output ports assigned to the lanes of the adaptive set, and with the SAN 

9 implementing data transfers as a sequence of request/response packet pairs, and with each 

10 request packet containing a header including a destination field, a method for flushing 

1 1 lanes in an adaptive set configured at said router, said method comprising the steps of: 

12 at said source node, preparing a sequence of write packets with the 

13 destination field of each packet in the sequence having adaptive control bits specifying a 

14 different lane in an adaptive set; 

15 at said source node, transmitting said sequence of write packets; 

16 at said router, receiving said write packets, and, if an adaptive set is 

17 defined, responding to the adaptive control bits of each received write packet to force said 

1 8 packet to the output port specified by the adaptive control bits in the write packet. 

1 4. The method of claim 3 further comprising the steps of: 

2 at the source node, including a particular value in each of the write packets 

3 and specifying a particular location at the destination node; 

4 at the destination node, for each write packet, storing said particular value 

5 at the specified location; 

6 at the source node, accessing the particular locations at the destination 

7 node and if the particular value is read from the particular locations specified by the 

8 sequence of write packets indicating that the barrier transaction was successful. 



# • 



1 5. The method of claim 3 further comprising the steps of: 

2 at the router, limiting the number of lanes in an adaptive set to a specified 

3 number; 

4 at the source node, forming said selected number of write packets in said 

5 sequence. 
6 

1 6. A routing topology comprising: 

2 a first level including first and second first-level routers, each first-level 

3 router having a first, second, and third input ports coupled to first, second, and third 

4 output ports by a cross-bar switch, and with each first-level router configured to include 

5 an adaptive set including first and second lanes, with the first input port associated with 

6 the adaptive set and a first output port associated with the first lane and a second output 

7 port associated with the second lane of the adaptive set, and with each first-level router 

8 including routing logic for adaptively assigning a lane in the adaptive set to adaptively 

9 route packets received at the first input port to first and second output ports associated 

1 0 with lanes of the adaptive set; 

1 1 a second level of routers including first and second first-level routers, each 

12 second-level router having first and second input ports coupled to first and second output 

1 3 ports by a cross-bar switch; 

14 a first uplink coupling the first output port of the first first-level router to 

1 5 the first input port of the first second-level router; 

16 a second uplink coupling the second output port of the first first-level 

1 7 router to the first input port of the second second-level router; 

1 8 a third uplink coupling the first output port of the second first-level router 

19 to the second input port of the first second-level router; 

20 a fourth uplink coupling the second output port of the second first-level 

21 router to the second input port of the second second-level router; 

22 a source node coupled to the input port of said first first-level router; and 

23 a destination node coupled to the third output port of said second first-level 

24 router. 



1 7. The routing topology of claim 6 further comprising: 

2 a first downlink coupling the first output port of the first second-level 

3 router to the second input port of the first first-level router; 

4 a second downlink coupling the second output port of the first second- 

5 level router to the second input port of the second first-level router; 

6 a third downlink coupling the first output port of the second second-level 

7 router to the third input port of the first first-level router; and 

8 a fourth downlink coupling the second output port of the second second- 

9 level router to the third input port of the second first-level router. 
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SYSTEM AND METHOD FOR CONFIGURING ADAPTIVE 
SETS OF LINKS BETWEEN ROUTERS IN A 
SYSTEM AREA NETWORK (SAN) 

ABSTRACT OF THE DISCLOSURE 

Adaptive sets of lanes are configured between routers in a system area 
network. Source nodes determine whether packets may be adaptively routed between the 
lanes by encoding adaptive control bits in the packet header. The adaptive control bits 
also facilitate the flushing of all lanes of the adaptive set. Adaptive sets may also be used 
in uplinks between levels of a fat tree. 
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Table 2 Physical link translation into Physical lane 
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Figure 9: Adaptive Control Bits (ACB) Encoding Definition 
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000 Ordered packet delivery to lane 0 if routed to a port in an Adaptive Set 
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01 1 Ordered packet delivery to lane 3 if routed to a port in an Adaptive Set 
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